{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "learning_rate = 10e-4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 随机梯度下降\n",
    "- 为了最小化损失函数，梯度下降计算损失函数对参数向量的梯度，并向梯度下降的方向改变参数；\n",
    "- 随机梯度下降，每一步从训练数据中随机选择一个样本，用来计算梯度\n",
    "$$\\theta\\leftarrow\\theta-\\eta \\nabla_\\theta J(\\theta)$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SGD:\n",
    "    def __init__(self, lr=0.01):\n",
    "        self.lr = lr\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        for key in params.keys():\n",
    "            params[key] -= self.lr * grads[key]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Momentum\n",
    "- 在梯度下降的基础上加入了动量；每一步将本地梯度添加到动量向量 $m$，加速训练\n",
    "\n",
    "$$ m\\leftarrow\\beta m+\\eta \\nabla_\\theta J(\\theta)$$\n",
    "$$\\theta\\leftarrow\\theta-m$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Momentum:\n",
    "    def __init__(self, lr=0.01, momemtum=0.9):\n",
    "        self.lr = lr\n",
    "        self.momemtum = momemtum\n",
    "        self.v = None\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        if self.v is None:\n",
    "            self.v = {}\n",
    "            for key, val in params.items():\n",
    "                self.v[key] = np.zeros_like(val)\n",
    "\n",
    "        for key in params.keys():\n",
    "            self.v[key] = self.momemtum * self.v[key] + self.lr * grads[key]\n",
    "            params[key] -= self.v[key]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = tf.compat.v1.train.MomentumOptimizer(learning_rate=learning_rate,\n",
    "                                                 momentum=0.9,\n",
    "                                                 use_nesterov=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nestrov\n",
    "- Nestrov为了加速收敛，提前按照之前的动量走了一步，然后求导后按梯度再走一步\n",
    "$$m\\leftarrow\\beta m+\\eta \\nabla_\\theta J(\\theta+\\beta m)$$\n",
    "$$\\theta\\leftarrow\\theta-m$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Nestrov:\n",
    "    def __init__(self, lr=0.01, momentum=0.9):\n",
    "        self.lr = lr\n",
    "        self.momentum = momentum\n",
    "        self.v = None\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        if self.v is None:\n",
    "            self.v = {}\n",
    "            for key, val in params.items():\n",
    "                self.v[key] = np.zeros_like(val)\n",
    "\n",
    "        for key in params.keys():\n",
    "            self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]\n",
    "            params[key] += self.momentum * self.v[key] - self.lr * grads[key]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adagrad\n",
    "- 自适应优化算法，通过每个参数的历史梯度，动态更新每一个参数的学习率，使得每个参数的更新率都能够逐渐减小。\n",
    "    - 首先将梯度的平方累加进向量 $s$\n",
    "    - 然后梯度向量缩小 $\\sqrt{s+\\epsilon}$，$\\epsilon$ 是平滑项，为了避免除零发生，通常设置为 $10^{-10}$ \n",
    "    - 在较陡的方向，即梯度大的方向，降低更快\n",
    "    $$s\\leftarrow s+\\nabla_\\theta J(\\theta) \\otimes\\nabla_\\theta J(\\theta)$$\n",
    "$$\\theta\\leftarrow\\theta-\\eta\\nabla_\\theta J(\\theta)\\oslash\\sqrt{s+\\epsilon}$$\n",
    "    - 在训练神经网络时，会过早停止训练\n",
    "                   \n",
    "          \n",
    "- 其具体过程如下，对于任一参数 $w$:\n",
    "    $$w^{t+1} = w^t - \\frac{\\eta^t}{\\sigma^t} g^t$$\n",
    "其中：$$\\eta^t = \\eta/\\sqrt{t+1}$$   \n",
    "$$g^t = \\frac{\\partial L(\\theta ^t)}{\\partial w}$$\n",
    "$$\\sigma^t  = \\sqrt{\\frac{1}{t+1}\\sum_{i=0}^{t}(g^i)^2}$$\n",
    "可得：\n",
    "     \n",
    "$\\boxed{w^{t+1} = w^t - \\frac{\\eta}{\\sqrt{\\sum_{i=0}^{t}(g^i)^2}} g^t}$\n",
    "\n",
    "即：\n",
    "    \n",
    "$w^{1} = w^0 - \\frac{\\eta}{\\sqrt{(g^0)^2}} g^0$\n",
    "      \n",
    "$w^{2} = w^1 - \\frac{\\eta}{\\sqrt{[g^0)^2+(g^1)^2]}} g^1$\n",
    "               \n",
    "$w^{3} = w^2 - \\frac{\\eta}{\\sqrt{[g^0)^1+(g^1)^2+(g^2)^2]}} g^2$                   \n",
    "\n",
    "\n",
    "对于二次方程：$y=ax^2+bx+c$，任一点 $x_0$，其最佳步长即为$|x_0+\\frac{b}{2a}|$\n",
    "   \n",
    "由：$$\\left|x_0+\\frac{b}{2a}\\right| = \\left|\\frac{2ax_0+b}{2a}\\right| = \\frac{\\left|\\frac{\\partial y}{\\partial x}\\right|_{x=x_0}}{\\frac{\\partial^2y}{\\partial x^2}|_{x=x_0}} $$\n",
    "可得：任意点$x_0$一阶微分越大，二阶微分越小，其最佳步长越大\n",
    "\n",
    "\n",
    "失函数 $L$ 的二阶微分可以近似表示成一阶微分的和$\\sqrt{\\sum_{i=0}^{t}(g^i)^2}$。故$\\boxed{\\frac{\\eta}{\\sqrt{\\sum_{i=0}^{t}(g^i)^2}} g^t}$近似表示每次更新时的最佳步长"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AdaGrad:\n",
    "    def __init__(self, lr=0.01):\n",
    "        self.lr = lr\n",
    "        self.h = None\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        if self.h is None:\n",
    "            self.h = {}\n",
    "            for key, val in params.items():\n",
    "                self.h[key] = np.zeros_like(val)\n",
    "\n",
    "        for key in params.keys():\n",
    "            self.h[key] += grads[key] * grads[key]\n",
    "            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.001,\n",
    "                                        initial_accumulator_value=0.1,\n",
    "                                        epsilon=1e-07,\n",
    "                                        name='Adagrad')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RMSprop\n",
    "- `AdaGrad` 学习率会不断地衰退，在达到最优解之前学习率就已经太小了\n",
    "- `RMSprop` 采用了使用指数衰减平均来慢慢丢弃先前的梯度历史，只累计最近迭代时的梯度，能够防止学习率过早地减小\n",
    "$$s\\leftarrow\\beta s+(1-\\beta)\\nabla_\\theta J(\\theta) \\otimes\\nabla_\\theta J(\\theta)$$\n",
    "$$\\theta\\leftarrow\\theta-\\eta\\nabla_\\theta J(\\theta)\\oslash\\sqrt{s+\\epsilon}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RMSprop:\n",
    "    def __init__(self, lr=0.01, decay_rate=0.99):\n",
    "        self.lr = lr\n",
    "        self.decay_rate = decay_rate\n",
    "        self.h = None\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        if self.h is None:\n",
    "            self.h = {}\n",
    "            for key, val in params.items():\n",
    "                self.h[key] = np.zeros_like(val)\n",
    "\n",
    "        for key in params.keys():\n",
    "            self.h[key] *= self.decay_rate\n",
    "            self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]\n",
    "            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001,\n",
    "                            rho=0.9,\n",
    "                            momentum=0.0,\n",
    "                            epsilon=1e-07,\n",
    "                            centered=False,\n",
    "                            name='RMSprop')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adam \n",
    "- `Adam` 组合了 `Momentum` 和 `RMSProp` 两种算法思想\n",
    "$$m\\leftarrow\\beta_1 m+(1-\\beta_1)\\nabla_\\theta J(\\theta)$$\n",
    "$$s\\leftarrow\\beta_2 s+(1-\\beta_2)\\nabla_\\theta J(\\theta) \\otimes\\nabla_\\theta J(\\theta)$$\n",
    "$$m\\leftarrow\\frac{m}{1-\\beta_1^T}$$\n",
    "$$s\\leftarrow\\frac{s}{1-\\beta_2^T}$$\n",
    "$$\\theta\\leftarrow\\theta-\\eta m\\oslash\\sqrt{s+\\epsilon}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Adam:\n",
    "    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):\n",
    "        self.lr = lr\n",
    "        self.beta1 = beta1\n",
    "        self.beta2 = beta2\n",
    "        self.iter = 0\n",
    "        self.m = None\n",
    "        self.v = None\n",
    "\n",
    "    def update(self, params, grads):\n",
    "        if self.m is None:\n",
    "            self.m, self.v = {}, {}\n",
    "            for key, val in params.items():\n",
    "                self.m[key] = np.zeros_like(val)\n",
    "                self.v[key] = np.zeros_like(val)\n",
    "\n",
    "        self.iter += 1\n",
    "        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (\n",
    "            1.0 - self.beta1**self.iter)\n",
    "\n",
    "        for key in params.keys():\n",
    "            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])\n",
    "            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])\n",
    "\n",
    "            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = tf.keras.optimizers.Adam(learning_rate=0.001,\n",
    "                                     beta_1=0.9,\n",
    "                                     beta_2=0.999,\n",
    "                                     epsilon=1e-07,\n",
    "                                     amsgrad=False,\n",
    "                                     name='Adam')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
